Collaboration

No one outside the group

Exploratory Data Analysis

Information on the competition can be found here.

First, to get a general idea of the dataset, let’s make a correlation plot to see the relationships between the variables. Additionally the correlation values of the 8 test variables in relation to the count variable.

##                    [,1]
## season      0.163439017
## holiday    -0.005392984
## workingday  0.011593866
## weather    -0.128655201
## temp        0.394453645
## atemp       0.389784437
## humidity   -0.317371479
## windspeed   0.101369470

It appears that none of the continuous variables have a strikingly clear relationship with the target variable. There appears to be useful explainitory information in some of the factor variables, but largely only in the upper range of the data (see box plots)

Then, let’s look at the spread of the response variable, ‘count’.

ggplot(data = train, aes(x = count)) + geom_density(adjust = 0.77) + labs(title = "Spread ")

fav_stats(train$count)
min Q1 median Q3 max mean sd n missing
1 42 145 284 977 191.5741 181.1445 10886 0

The density plot is fairly rightskewed, with most of the values being

We can see the relationship between our response variable and three predictors we selected, to see how bike rentals are impacted by Temperature, Humidity, and whether the day in question is a holiday.

train$holiday1<-0
train$holiday1[which(train$holiday == 0)] <- 'Not Holiday'
train$holiday1[which(train$holiday == 1)] <- 'Holiday'
train$holiday <- as.factor(train$holiday)
p <- plot_ly(train, x = ~temp, y = ~humidity, z = ~count, color= ~holiday1, colors = c('#BF382A', '#0C4B8E')) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'Humidity'),
                     yaxis = list(title = 'Temperature (Celsius)'),
                     zaxis = list(title = 'Count')))
ggplotly(p)

Model Fit

We are picking a model which uses Windspeed, temperature, and Humidity as the three predictors to predict Count with. JUSTIFICATION:

In the correlation plot above, these 3 variables had the strongest apparent relationship to “count”. Windspeed was used instead of adjusted temperature to avoid obvious covariance between temp and atemp.

The following represents the model being applied to the test set via predict():

train <- read_csv("data/train.csv")
model1<- lm(log(count)~ windspeed+temp+humidity, data=train)
summary(model1)
## 
## Call:
## lm(formula = log(count) ~ windspeed + temp + humidity, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5604 -0.6476  0.2639  0.8750  3.2963 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.511861   0.065391  68.998  < 2e-16 ***
## windspeed    0.005989   0.001616   3.706 0.000212 ***
## temp         0.067915   0.001609  42.216  < 2e-16 ***
## humidity    -0.022783   0.000687 -33.162  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.304 on 10882 degrees of freedom
## Multiple R-squared:  0.2334, Adjusted R-squared:  0.2332 
## F-statistic:  1104 on 3 and 10882 DF,  p-value: < 2.2e-16
mplot(model1)[1:2]
## [[1]]

## 
## [[2]]

Test_predict <- predict(model1, newdata = test)

Test_predict_table <- data_frame(Test_predict) %>% 
  mutate(count = exp(Test_predict)) %>% 
  mutate(datetime = sample_submission$datetime) %>%
  select(datetime, count)

All of the predictors in this model are significant. That said, most variants of this model (including when we log and/or take the square roots) violate the equal variance and normality of error conditions.

The following code is intended to be used in re-submission for cross validation

set.seed(3400)
nfold <- 5
nrow(train)/nfold
## [1] 2177.2
k <- 5
Fold_train <- train %>% 
  sample_frac(1) %>% 
  mutate(fold = rep(1:k, len=nrow(train)))

"indices <- sample(nrow(train), nrow(train), replace=FALSE)
splits<-split(indices, ceiling(seq_along(indices)/2177))
splits<-as.data.frame(splits)
d1a <- train[splits$X1,]
d1b <- train[splits$X2,]
d1c <- train[splits$X3,]
d1d <- train[splits$X4,]
d1e <- train[splits$X5,]
folds <- list[d1a,d1b,d1c,d1d,d1e]
rmse1<-0
rmse2<-0
rmse3<-0
rmse4<-0
rmse5<-0
rmses <- c(rmse1,rmse2,rmse3,rmse4,rmse5)"
## [1] "indices <- sample(nrow(train), nrow(train), replace=FALSE)\nsplits<-split(indices, ceiling(seq_along(indices)/2177))\nsplits<-as.data.frame(splits)\nd1a <- train[splits$X1,]\nd1b <- train[splits$X2,]\nd1c <- train[splits$X3,]\nd1d <- train[splits$X4,]\nd1e <- train[splits$X5,]\nfolds <- list[d1a,d1b,d1c,d1d,d1e]\nrmse1<-0\nrmse2<-0\nrmse3<-0\nrmse4<-0\nrmse5<-0\nrmses <- c(rmse1,rmse2,rmse3,rmse4,rmse5)"
"for (i in 1:5){
  rmses[i-1] <- lm(count~temp~workingday+weather, data = subset(Fold_train, fold = i)) %>% 
    predict(subset(Fold_train, fold != i)) %>% 
    `-` (subset(Fold_train, fold != i)$count) %>% 
    `^`(2) %>% 
    sum %>% 
   `/` (subset(Fold_train, fold != i)$count) %>% 
    sqrt
}"
## [1] "for (i in 1:5){\n  rmses[i-1] <- lm(count~temp~workingday+weather, data = subset(Fold_train, fold = i)) %>% \n    predict(subset(Fold_train, fold != i)) %>% \n    `-` (subset(Fold_train, fold != i)$count) %>% \n    `^`(2) %>% \n    sum %>% \n   `/` (subset(Fold_train, fold != i)$count) %>% \n    sqrt\n}"

Create Submission File

write.csv(Test_predict_table, "submission.csv", row.names = F)

FINAL CHECK

. every plot has all labels (including units) . code chunks are hidden in the output . our names + date etc are all up there